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Comparison of human and chimpanzee genomes has received much attention, because of 
paramount role for understanding evolutionary step distinguishing us from our closest 
living relative. In order to contribute to insight into Y chromosome evolutionary history, 
we study and compare tandems, higher order repeats (HORs), and regularly dispersed 
repeats in human and chimpanzee Y chromosome contigs, using robust Global Repeat 
Map algorithm. We find a new type of long-range acceleration, human-accelerated HOR 
regions. In peripheral domains of 35mer human alphoid HORs, we find riddled features 
with ten additional repeat monomers. In chimpanzee, we identify 30mer alphoid HOR. 
We construct alphoid HOR schemes showing significant human-chimpanzee difference, 
revealing rapid evolution after human-chimpanzee separation. We identify and analyze 
over 20 large repeat units, most of them reported here for the first time as: chimpanzee 
and human ~1.6 kb 3mer secondary repeat unit (SRU) and ~23.5 kb tertiary repeat unit 
(~0.55 kb primary repeat unit, PRU); human 10848, 15775, 20309, 60910, and 72140 
bp PRUs; human 3mer SRU (~2.4 kb PRU); 715mer and 1123mer SRUs (5mer PRU); 
chimpanzee 5096, 10762, 10853, 60523 bp PRUs; and chimpanzee 64624 bp SRU (10853 bp 
PRU). We show that substantial human-chimpanzee differences are concentrated in large 
repeat structures, at the level of as much as ~70% divergence, sizably exceeding previous 
numerical estimates for some selected noncoding sequences. Smeared over the whole 
sequenced assembly (25 Mb) this gives ~14% human-chimpanzee divergence. This is 
significantly higher estimate of divergence between human and chimpanzee than previous 
estimates. 
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I. INTRODUCTION 

A. Atypical Structure of Human Y Chromosome 

One of challenging problems in genomics is related to 
the evolutionary development of Y chromosome. The Y 
chromosome has a unique role in human population ge- 
netics with properties that distinguish it from all other 



chromosomes ( 


Jobling and Tyler-Smith 


2003 


et al. 


1985 


Skaletsky et al. 


2003 


I . Prevailing 



that X and Y chromosomes evolved from a pair of au- 



tosomes ( 


Graves 


1995 Lahn and Page 


1999 


Marshall 


Graves|2006 


Mu 


ler|1914 


Ohno|1967 


). Lack of recombi- 
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nation between nonrecombining parts of X and Y chro- 
mosomes was thought to be responsible for decay of the 
Y-linked genes, the pace of which slows over time, even- 
tually leading to a paucity of genes. Identification of dis- 
tinct palindromes harboring several distinct gene families 
unique to the long arm of Y chromosome, frequent gene 
conversion, and multiplication have raised some doubt 



about progressive decay of the Y chromosome (|Ali and 


Hasnian||2003| |de Knijfl[| 


2006| |Kuroda-Kawaguchi et al. 


2001 


Rozen et al. 


2003 


Skaletsky et al. 


2003 


). It was 



shown that the Y chromosome has acquired a large num- 
ber of testis specific genes during the course of evolution, 



including those essential for spermatogenesis ( 


Saxena et 


al.|1996 


Silber and Repping|2002 


Skaletsky et al.|2003|. 



Considerations of atypical structure of human Y chro- 
mosome were largely focused on the gene-related content. 
On the other hand, however, the human Y chromosome 
is replete with many pronounced repetitive sequences, 



2 



and multicopy gene arrays are embedded in palindromes tandemly repeated DNA 



(jCooper et al. 1993a Kirsch et al. 


2008 Oakey and 


Tyler-Smith 1990| Perry et al. |2007 


Rozen et al. |2003 


Skaletsky et ai;2003||Tyler-Smith|1985 Tyler-Smith and 


Brown,, 1987^1 Wolfe et al.||1985). 



families ( 


Warburton and 


Willard and Waye||1987 


)■ 



C. Bioinformatics Studies of Alphoid HORs 



B. Alphoid Higher Order Repeats 

Alphoid arrays in centromeres of human and other 
mammal chromosomes consist of tandem repeats of AT- 



rich alpha satellites (Alexandrov et al. 


2001 Choo||1997 


Maio| 


1971| Manuehdis and Wu 1978| IMitchell et al. 


19851 


Romanova et al. 1996| Rudd et al. 2006; 'Tyler- 


Smith 


19851 Tyler-Smith and Brown 


1987; Warburton 


and V 


VillardI |1996| Warburton et al. 


1996i |Waye and 


Willard 1987 Willard 1985). Stretches of alpha satel- 



lites lacking any higher-order periodicity mutually di 



verge by^20-35% and are referred to as monomeric ( War- 
burton and Willard|p96l ) . 



Higher order repeats (HORs) are defined as higher 
order periodicity pattern superimposed on the approxi- 
mately periodic tandem of alpha monomers: if an array of 
n monomers denoted by 1, 2, . . . , n is followed by the next 
array of monomers denoted by n -I- 1, n -I- 2, ... , 2n, where 
the monomer 1 is almost identical (more than 95%) to 
the monomer n + 1, the monomer 2 to the monomer n+2, 
and the monomer n to the monomer 2n, these arrays be- 



long to the nmer HOR (Warburton and Willard 19961. 



The HOR copies from the same locus diverge from each 
other by < 5%, while the alpha satellite copies within 
any HOR copy diverge from each other by ^ 20 — 35% 



Alphoid HORs are chromosome-specific ( 


Choo 


1997 


Haaf and Willard 1992'; Jorgens! 
and Willard 1996, Willard ,1985; 


3n et al.|1986; Warburton] 
Willard and WaycT987f^ 



HOR units that differ by an integral number of monomers 
(monomer insertion or deletion), but nonetheless closely 



related in sequenc e (|Haaf and Willard 1992 Warburton 
and Willard|[T996| 



Investigations using restriction endonuclease digestion 
have revealed a major block of alphoid DNA in the cen- 



tromeric region of human Y chromosome 


Cooper et al. 


1993a|b| IMitchell et al. 


19851 ITyler-Smith 


1985 Tyler- 


Smith and Brown 


1987 


Wolfe et al. 


1985 


. The size of 



this alphoid block was found to be polymorphic, widely 



varying between different individuals 


Oakey and Tyler- 


Smith 1990| Tyler-Smith and Brown 


1987). Initially, a 



5.7 kb HOR unit was reported as a major variant of 
secondary periodicity and 6.0 kb HOR unit as a minor 
variant. These HOR units were associated with 34mer 
and 36mer, respectively ( Tyler-Smith and Brown||1987 ). 
In a more recent study, a 5941 bp secondary periodicity 



(35 alphoid repeat units) was reported (Skaletsky et al. 
[2003). 

The alpha satellite DNA can be considered as 
a paradigm for processes of concerted evolution in 



During the last decade sequence contigs spanning the 
junction at the edges of the centromere DNA array are 



available for bioinformatics analyses (Nusbaum et al. 
2006'; 'Paar et al."2005', '20 071 IRosandic e t al. 2003a'b| 
2006, |Ross et al.,,2005; ,Rudd and Willard,. 2004, Rudd] 
et aL||2003[ [Skaletsky et al.||2003[ ). However, major gaps 



still remain at the centromeric region of chromosomes 
(|Henikoff||2002l |Rudd and Willard|[2004l [Schueller et aT 
2001D. Mostly, only peripheral HOR copies are accessi 



ble, at the edges of centromeric region. Previously, |Rudd| 
and Willard (2004) analyzed the Build 34 assembly, us- 



ing a comb ination of BLAST ( Altschul et al. 19901 and 
DOTTER ( [Sonnhammer and Durbin|1995p , and reported 
the presence of HORs. Rece ntly, using Tandem Repeat 
Finder (TRF) (Benson 1999!) and other standard bioin- 



formatics tools, Gelfand et al. (2007) and [Warburton et 



al. (2008) studied human HORs in more details. 

In a different approach, we have shown that the Key 
String Algorithm (KSA) and an extension Global Repeat 
Map (GRM) are effective in identification and analysis 



of intrinsic structure of HOR s (Paar et al. 2005 2007 
Rosandic et al.|2003a|b[ [20061 ). Applying KSA and GR^^: 



to the NCBI human genome assembly, detailed structure 
of known and some new human alphoid HORs was de- 
termined. 



D. Comparison of Human and Chimpanzee Genome 
Sequences 

To understand the genetic basis of unique human fea- 
tures, the human and chimpanzee genomes have been 
compared in a number of studies ([Bailey and Eichler 



2006 


BoffeUi et al.||2003l jChen and Li||2001| 


Cheng et al. 


2005 


Ebersberger et aH2007 Fujiyama et a 


[.|2002 Haaf 


and Willard 1997, 1998 Kehrer-Sawatzki and Cooper 



Var 



bin 



2007[ jKhaitovich et al.t|2005[ jKing and Wilson| | 1975 



Kuroki et al. 2006 Laursen et al. 1992^ Liu et al. 2009 



Mikkelsen et al. 20051 Newman et al. 120051 lOlson anc 



2003; Patterson et al. 2006; Pennacchio and Ru- 



2001, ,Perry et al.||2008; ,Sibley and Ahlquistljigs"^ 



Varki and Altheide||20051 [Varki et al.||20081 [Watanabej 



et al. 2004 Webster et al. 2003). Large variation in 



sequence divergence was often seen among genomic re- 
gions. For example, the last intron of the ZFY gene 
showed only 0.69% divergence between human and chim- 
panzee (Dorit et al. 1995 ), whereas for the 0R1D3P pseu- 



dogene a divergence of 3.04% was found (Glusman et 



do 


gene £ 


al. 


2000 



al. 2000). Thus, to have reliable estimates of the aver- 
age divergences between hominoid genomes, it was con- 
cluded that sequence data from many genomic regions are 
needed ( Chen and Li|200T ) . Estimates of divergence due 



3 



to nucleotide substitutions were about 1.24% between se- 
lected intergenic nonrepetitive DNA segments in humans 
and chimpanzees, substantially lower than previous ones. 



of about 3%, which included repetitive sequences (Chen 
and Li 2001 Ebersberger et al. 2002[ [Fujiyama et al. 



2002 Mikkelsen et al. 2005). A greater sequence diver- 



gence (T778%y~was~obtained between reported finished 
sequence of the chimpanzee Y chromosome (PTRY) and 
the human Y chromosome (Kuroki et al. [2006 ). Com- 
paring the DNA sequences of unique, Y-linked genes in 
chimpanzee and human, evidence was found that in the 
human lineage all such genes were conserved, and in the 
chimpanzee lineage, by contra st, several genes hav e sus- 
tained inactivating mutations ( Hughes et al.||200"5 ). 

On the other hand, the overall sequence divergence by 
taking regions of i ndels into account was estimated to be 
approximately 5% ( |Britten|2002[[2003||Cheng et al.|2005 



Gibbs et al. 



2007). In some short stretches of human 



and chimpanzee genomes, so called human-accelerated 
regions, si gnificant increase of substitution divergence 
was found fP ollard|[2009l [Pollard et al.||2006a|b[ [Popesco 
[et al . 2006; Prabhakar et al[[2006[). On the other hand. 



based on phylogenetic analysis of large number of DNA 
sequence alignments from human and chimpanzee it was 
found that for a sizeable fraction of our genome we share 



no immediate genetic ancestry with chimpanzee (Ebers- 
berger et al.][2(]i07[ ) . 



Experimental evidence suggests that a progenitor of 
suprachromosomal alphoid family 3 was established and 
dispersed to chimpanzee chromosomes homologous to hu- 
man chromosomes 1, 11, 17 and X prior to the human- 



chimpanzee split (Baldini et al. 1991; Durfy and Willard 
19901 IWarburton et al.||1996, ,Willardj^l991|). Notably, 



11 lJ 



the alphoid HOR organization in the X chromosome has 
been conserved (Durfy and Willard 1990[ ); only the lo- 
calization of the suprachromosomal family (SF) 3 alpha 
satellite is substantially conserved. It was concluded that 
the lack of sequence or HOR conservation among human 
and chimpanzee indicates that most alpha satellite se- 
quences do not evolve orthologously. 
In a recent publication. 



Hughes et al. (2010) have 



shown by sequence comparison of human and chimpanzee 
MSY that humans and chimpanzees differ radically in 
sequence structure and gene content. It was concluded 
that, since the separation of human and chimpanzee lin- 
eages, sequence gain and loss have been far more concen- 
trated in the MSY than in the balance of the genome, 
indicating accelerated structural remodeling of the MSY 
in the chimpanzee and human lineages during the past 
six million years. 

The previously reported 35mer alphoid HOR in human 



Y chromosome ([Skaletsky et al.[|2003 ^, ''Tyler-Smi th and 
p3rown 1987; Warburton and Willard 1996) involves the 
largest alphoid HOR unit found in human genome and 
it is of particular interest to look for divergence between 
alphoid HOR in human and chimpanzee Y chromosome. 
Alphoid HOR in chimpanzee Y chromosome was not yet 
reported. 



Having in mind possibly important information regard- 
ing the evolutionary role of human and chimpanzee Y 
chromosomes and availability of their genomic sequences 
([Mikkelsen et al.[[2005| [Skaletsky et al.[[2003| and a de- 



manding task of studying bioinformatically such long 
HOR units, we perform here an extensive study ap- 
plying novel robust bioinformatics tools CRM. We in- 
vestigate the major alphoid HOR from Build 37.1 as- 
sembly of human Y chromosome and determine detailed 
monomer scheme and consensus sequence, finding a rid- 
dling pattern not reported previously. In the chimpanzee 
Y chromosome, for the first time, we identify and analyze 
alphoid HOR. We find that the human and chimpanzee 
HORs are sizeably different, both in size and composition 
of HOR units and in the constituting monomer structure. 

Furthermore, we identify and investigate in human and 
chimpanzee Y chromosomes more than 20 other tandems, 
HORs and regularly dispersed repeats based on large re- 
peat units, showing sizeable human-chimpanzee diver- 
gence. Most of these repeats are reported here for the 
first time. 



II. MATERIALS AND METHODS 
A. Key String Algorithm 

In spite of powerful standard bioinformatics tools, 
there are still difficulties to identify and analyze large 
repeat units. For example, the detection limit of TRF is 
2 kb ( [Gelfand et al.[[2007| [Warburton et al"][2008l ). Here, 
we use a new approach useful in particular for very long 
and/or complex repeats. 

The KSA framework is based on the use of a freely 
chosen short sequence of nucleotides, called a key string, 
which cuts a given genomic sequence at each location of 
the key string within genomic sequence. Going along ge- 
nomic sequence, the lengths of ensuing KSA fragments 
form KSA length array. Such array could be compared 
to an array of lengths of restriction fragments resulting 
from a hypothetical complete digestion, cutting genomic 
sequence at recognition sites corresponding to KSA key 
string. Any periodicity appearing in the KSAlength ar- 
ray enables identification and location of repeat in a given 
genomic sequence. Analysis of repeat sequences at po- 
sition of any periodicity in the KSA length array gives 
consensus repeat unit and divergence of each repeat copy 
with respect to consensus. Any presence of higher order 
periodicity in the KSA length array reveals the presence 
of HOR at that location and enables determination of 
consensus HOR repeat unit and divergence of each HOR 
copy with respect to consensus. 

Similarly, with a proper choice of key string, the KSA 
fragments a given tandem repeat into monomers, as for 
example cutting Alu sequence at two identical positions 
providing identification of Alu sequences, cuts a palin- 
drome providing identification of large palindrome se- 
quences and their substructure, and so on. KSA pro- 
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vides a straightforward ordering of KSA fragments, re- 
gardless of their size (from small fragments of a few bp 
to as large as tens of kilobasepairs) . KSA provides high 
degree of robustness and requires only a modest scope of 
computations using PC. Due to its robustness, KSA is 
effective even in cases of significant deletions, insertions, 
and substitutions, providing detailed HOR annotation 
and structure, consensus sequence, and exact consensus 
length in a given genomic sequence even if it is highly 
distorted, intertwined and riddled (segmentally fuzzy re- 
peats). Using a HOR consensus sequence, in the next 
step KSA computes finer characteristics, as for exam- 
ple the SF classification and CENP-B box/pJa distribu- 
tions. 



B. Global Repeat Map 

The GRM program is an extension of KSA framework. 
GRM of a given genomic sequence is executed in five 
steps. 

Step 1 GRM-Total module Computes the frequency ver- 
sus fragment length distribution for a given ge- 
nomic sequence by superposing results of consecu- 
tive KSA segmentations computed for an ensemble 
of all 8 bp key strings (4* — 65536 key strings). In 
GRM diagram, each pronounced peak corresponds 
to one or more repeats at that length, tandem or 
dispersed. GRM computation is fast and can be 
easily executed for human chromosome using PC. 

Step 2 GRM-Dom module Determines dominant key 
string corresponding to fragment length for each 
peak in the GRM diagram from the step 1. A par- 
ticular 8 bp key string (or a group of 8 bp key 
strings) that gives the largest frequency for a frag- 
ment length under consideration is referred to as 
dominant key string. 

Step 3 GRM-Seg module Performs segmentation of a 
given genomic sequence into KSA fragments using 
dominant key string from the step 2. Any periodic 
segment within the KSA length array reveals the 
location of repeat and provides genomic sequences 
of the corresponding repeat copies. 

Step 4 GRM-Cons module Aligning all sequences of re- 
peat copies from step 3 constructs the consensus 
sequence. 

Step 5 NW module Computes divergence between each 
repeat copy from step 3 and consensus sequence 
from step 4 using Needleman -Wunsch algorithm 
( [Needleman and Wunsch|[l970[ ). 

Regarding the 8 bp choice of key string size: using an 
ensemble of r-bp key strings the average length of KSA 
fragments is '-^ 4"". With increasing length of key strings 
the overall frequency of large fragment lengths increases. 



We tested that the 8 bp key string ensemble is suitable for 
identification of repeat units in a wide range of lengths, 
from ^10 bp to as much as ~100 kb. However, from 
GRM construction it follows that fully reliable results are 
obtained for key string lengths not exceeding the repeat 
length under study. 

In summary, the characteristics of GRM are: 

- robustness of the method with respect to deviations 
from perfect repeats, i.e., substitutions, insertions, and 
deletions; 

- use of ensemble of all 8 bp key strings as a starting 
point of algorithm, thus avoiding the need to choose a 
particular key string for any repeat structure; 

- straightforward identification of repeats (tandem and 
dispersed), applicable to very large repeat units, as 
large as tens of kilobasepairs; 

- easy identification of HORs and determination of con- 
sensus lengths and consensus sequences. 



III. RESULTS AND DISCUSSION 

Using GRM algorithm we have identified and analyzed 
tandem repeats, HORs and regularly dispersed repeats 
with large repeat units in human and chimpanzee Y chro- 
mosomes (Build 37.1 and Build 2.1 assemblies, respec- 
tively). Summary of all large repeat units identified and 
analyzed in this article and the human-chimpanzee com- 
parison are given in Tables |Tj [Hj and |III[ 



A. Alphoid Higher Order Repeat Units in Human and 
Chimpanzee Y Chromosome 

1. Riddled HOR Scheme with 45 Distinct Alphoid Monomers in 
Human Y Chromosome 

The largest repeat array in human Y chromosome 
assemblies studied here is the major alphoid HOR ar- 
ray and, as will be shown here, strongly diverges from 
the chimpanzee alphoid HOR. For this reason, we first 
present our results for alphoid HORs. In the contig 
NT_087001.1 in centromere of human chromosome Y 
and in NT_011878.9 in the pericentromeric region on the 
proximal side of p arm (DYZ3 locus), we identify the 
peripheral segments of the major block of alphoid HOR 
array. In the spacing between these two contigs lies a 
large central section of this HOR array. This spacing of 
^3 Mb was not sequenced so far in the Build 37.1 assem- 
bly. The GRM results for alphoid monomer structure of 
the two peripheral HOR segments are shown in Fig. [l] 
and Supplementary Table 1. In Fig.jl] we use a method 
of schematic presentation described by |Rosandic et al.| 
(|2006l). 
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TABLE I Tandem repeats, HORs 


and dispersed repeats with large repeat units in contigs of human Y chromosome. 




Repeat unit (bp) 


Structure 


Character 


Contig 


Chr Y start 


Chr Y end 


Length 








position 


position 




~171 


PRU" 


Alpha sateUite 


NT_011878.9 


10083775 


13131913 


3048138 


35mer(45mer'^) 


SRU" 


Alphoid HOR 


NT_087001.1 








125 


PRU 


Tandem 


NT_011875.12 


22216726 


22513032 


296306 


-545 


PRU" 


Tandem 


see Table [v] 






12577 


~1641'^ 


SRU" 


Regularly dispersed 










~23541'= 


TRU'' 


Third order tandem 


NT_011903.12 


24023693 


24070760 


47067 










24312159 


24333896 


21737 










24544818 


24566560 


21742 








NT_011875.12 


23654663 


23713744 


59081 


~2385 


PRU" 


Tandem 


NT_011903.12 


25298078 


25312458 


14380 


~4757" 


SRU" 


2mer HOR 




25376692 


25424719 


48027 


~7155" 


SRU" 


3mer HOR 




26929417 


26948531 


19114 










27001927 


27038009 


36082 


5 


PRU" 


Tandem 


NT_025975.2 


58819393 


58917657 


98264 


-3579 


SRU" 


715mer HOR 










5 


PRU" 


Tandem 


NT_113819.1 


13690637 


13747836 


57199 


~5607" 


SRU" 


1123mer HOR 










~5096" 


PRU*" 


Dispersed 


NT_011875.12 


20121395 


20126501 


5106 










20003268 


20008374 


5106 






Dispersed 


NT_011903.12 


26206614 


26211701 


5087 










27750731 


27755818 


5087 


~10848 


PRU*" 


Tandem 


NT_011903.12 


25312733 


25341062 


28329 










26984151 


27001645 


17494 


~15766'^ 


PRU' 


Dispersed 


NT_011875.12 


23167813 


23183579 


15766 










23209651 


23225434 


15783 


~15775'= 


PRU' 


Tandem 


NT_011896.9 


6543373 


6574923 


31550 




PRU' 


Dispersed 


NT_011651.17 


14540408 


14556183 


15775 


~20309 


PRU' 


Tandem 


NT_011878.9 


9293306 


9374535 


81229 




PRU' 


Tandem 


NT .086998. 1 


9170808 


9241328 


70520 


~60910'= 


PRU' 


Dispersed 


NT_011875.12 


19697222 


19759044 


60917 










20420735 


20482553 


60909 


~72140'= 


PRU' 


Dispersed 


NT_011875.12 


19829682 


19900381 


70699 










20279397 


20350098 


70701 



PRU primary repeat unit, SRU secondary repeat unit, TRU tertiary repeat unit, dispersed dispersed at random spacings, regularly 
dispersed dispersed at regular spacings 

"Described in text 

''Described in Supplementary text 

'^For the first time reported in this work 



In each of these two segments we identify 45 dis- 
tinct alphoid monomers, denoted toOI, . . . , to45, ar- 
ranged head-to-tail in the same orientation and mutu- 
ally diverging by ^20%. The consensus length of this 
45mer HOR is 7662 bp. Here, an alphoid monomer is as- 
signed as constituent of HOR if it appears in at least two 
HOR copies at a very low mutual divergence. Consen- 
sus sequences of monomers forming HOR are shown in 
Supplementary Table 2. In both the contigs, the consen- 
sus sequences of monomers constituting HOR are equal. 



reflecting the fact that they are two peripheral segments 
of the same HOR array (Table IV I. 



Divergence between monomers in individual HOR 
copies and the corresponding consensus monomers is very 
low (on the average 0.3%). However, the HOR structure 
is characterized by some pronounced monomer deletions 
and insertions, giving a riddled pattern (Table |IV[ ) due 
to a variety of lengths of HOR copies (Fig. [T]). We find 
monomer deletions in seven HOR copies, monomer in- 
sertions in two, and nonalphoid insertions of 0.2 to 0.3 
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TABLE II Tandem repeats, HORs and dispersed repeats with large repeat units in contigs of chimpanzee Y chromosome. 



Repeat unit (bp) 


Structure 


Character 


Contig 


Chr Y start 
position 


Chr Y end 
position 


Length 


~171 


PRU" 


Alpha satellite 


NW_001252921.1 


7108946 


7151404 


42458 


SOmer'^ 


SRU" 


Alphoid HOR 














PRU" 


Tandem See Table 


IV 








30832 


~1652'= 


SRU" 


Regularly dispersed 










~23578'= 


TRU" 


Third order tandem 


NW_001252921.1 


7707476 


7728531 


21055 










8130226 


8160370 


30144 










8433315 


8464030 


30715 










8866559 


8897264 


30705 










9166900 


9197050 


30150 










9598779 


9628923 


30144 


~2383 


PRU" 


Tandem 


NW_001252917.1 


3256815 


3278585 


21770 










3406716 


3428486 


21770 






Tandem 


NW_001252922.1 


11224963 


11256302 


31339 










11298117 


11327074 


28957 




PRU' 


Tandem 


NW_001252916.1 


1956128 


1981606 


25478 










2082379 


2092569 


10190 






Dispersed 


NW_001252920.1 


5633270 


5638363 


5093 






Dispersed 


NW_001252924.1 


12174453 


12179546 


5093 










12280382 


12285475 


5093 


~10762'= 


PRU* 


Tandem 


NW_001252919.1 


276373 


308349 


31976 






Tandem 


NW_001252921.1 


2823896 


2845204 


21308 






Dispersed 


NW_001252925.1 


1219588 


1230035 


10447 


~10853'= 


PRU" 


Tandem 


NW_001252917.1 


1130756 


1160123 


29367 


~64624'^ 


SRU" 






1174942 


1224747 


49805 


~60523'= 


PRU' 


Tandem 


NW_001252918.1 


3827479 


3948523 


121044 




PRU' 


Dispersed 


NW_001252922.1 


10310933 


10371414 


60481 




PRU' 


Dispersed 


NW_001252919.1 


5301324 


5361771 


60447 


~71778" 


PRU 


Dispersed 


NW_001252925.1 


12394038 


12465796 


71758 




PRU 


Dispersed 


NW_001252915.1 


1775843 


1847647 


71804 




PRU 


Dispersed 


NW_001252917.1 


2201698 


2273485 


71787 




PRU 


Dispersed 


NW_001252919.1 


5440228 


5505887 


65659 


~72140" 


PRU 


Tandem 


NW_001252923.1 


11947703 


12091619 


143916 



For description see Table I 



kb in three HOR copies. (In some HOR copies there are 
multiple insertions and/or deletions.) 

Two out of ten HOR copies contain the 10-alphoid~ 
monomer subsequence m24, . . . , m33 (Fig. [T]). These ten 
monomers are positioned between the monomers m23 
and m34. Distance between the two highly identical 10- 
alphoid-monomer subsequences is ^3 Mb. 

The other 35 alphoid monomers from 45 distinct 
alphoid monomers in the peripheral region of major 
alphoid HOR form a subsequence, consisting of two seg- 
ments, additionally riddled at some positions. Each of 
these 35 alphoid monomers appears in three or more 
HOR copies (Fig. [T]). If we delete the 10-alphoid- 
monomer subsequence from the 45mer, we obtain a 5957 
bp 35mer, which is similar to the secondary periodicity 



sequence of 5941 bp reported in Skaletsky et al. (2003) 



Discussing relationship of the initially reported 5.7 and 
6.0 kb repeat units, Tyler-Smith and Brown proposed 
that one HOR unit is derived from the other, although 
more complex explanations, with both units derived from 
a third unknown HOR unit were considered as possible 



( Tyler-Smith and Brown 1987 ) . It was considered as very 



unlikely that the 6.0 kb unit arose from a 5.7 kb unit by 
addition of two alphoid monomers, because results ex- 
cluded the possibility that the two additional alphoid 
monomers in the 6.0 kb unit are duplications of any 
monomers contained in the 5.7 kb unit (Tyler-Smith and 
Brown|1987 ) . Therefore, the favored hypothesis was that 
the shorter, 5.7 kb HOR unit arose from the longer 6.0 kb 
HOR unit by deletion of two alpha monomers. Extending 
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FIG. 1 Schematic presentation of aligned monomer structure of 45mer alphoid HOR (consensus length 7662 bp) in human chro- 
mosome Y (Build 37.1). This method of schematic presentation of HOR sequences is self-evident if one compares Fig. [l] and 
Supplementary Table 1. Top enumeration of columns corresponding to 45 constituent consensus monomers (enumerated Nos. 1 to 
45) in consensus HOR. (For simplicity, only every fifth number is shown.) Each HOR copy is presented by a bar in the corresponding 
column numerated at the top. Monomers from different HOR copies corresponding to the same monomer from consensus HOR are 
presented by bars in the same column corresponding to its enumeration at the top. For example, in the first HOR copy the first 
monomer corresponds to monomer No. 6 in consensus HOR and is presented by a bar at position of 6th column (denoted by 6i), the 
second monomer in the first HOR copy corresponds to monomer No. 7 in consensus HOR and is presented by a bar at the position 
of 7th column. . . , the fourth monomer in the first HOR copy corresponds to monomer No. 15 in consensus HOR and is presented by 
a bar at the position of 15th column. . . , and the last monomer in the first HOR copy (the 23rd) corresponds to the monomer No. 
45 in consensus HOR and is presented by a bar at the position of 45th column. Upper panel: HOR copies in contig NT_011878.9. 
Lower panel: HOR copies in contig NT_087001.1. Middle panel: The 5941 bp secondary periodicity sequence from |Skaletsky at al.| 
([2^003) mapped into alphoid monomers {rn}. For mapping of {roj-monomers from Skaletsky et al. ( 2003 ) into {m}-monomers, see 
the text and Supplementary Tables 2-4. Open circle: pja motif (essential part) in alpha monomers. The m05 monomer from the 
last incomplete HOR copy (Ss) in contig NT_087001.1 is followed by alpha satellite monomeric region (not shown here), a After 
m08: 210 bp insertion (no similarity to HOR monomers); b after ml3-ml6 duplication (inserted after ml7) there are two insertions: 
170 bp insertion (differing in 19 bases from m24 and m34 as the closest monomers from HOR) and 168 bp insertion (differing in 20 
bases from m28 as the closest monomer from HOR); c after m40: 278 bp insertion (no similarity to HOR monomers); d after the 
first 34 bases from ml5: end of the contig NT_011878.9; e the last 166 bases of ml4: start of the contig NT_087001.1; f after ml7: 
311 bp insertion (no similarity to HOR monomers); g after m36: 171 bp insertion (differing in 13 bases from m23 as the closest 
monomer from HOR); h, i two deletions in 'w20; j 53 bp nonalphoid insertion in it;29 



similar considerations to the present case, the 35mer in 
internal centromere region could be considered as arising 
from 45mer by deletion of ten alphoid monomers which 
are all distinct from the monomers in 35mer. This is 



consistent with a general view (Warburton and Willard 
1996 ) that a type of polymorphism found in alphoid ar- 



rays can be related to HOR units that differ by an inte- 
gral number of alphoid monomers. 

Divergence pattern provides an additional evidence 
that ten additional alphoid monomers m24, . . . , m33 are 
constituents of major HOR. Mutual divergence between 
these ten monomers is similar to their mean divergence 
with respect to the other 35 monomers (Table Iv]). 



2. Suprachromosomal Family Assignment of Monomers in 
45mer HOR 



Studies of sequence comparison of alpha satellite 
monomers in human chromosomes revealed 12 types 
of monomers, forming five suprachromosomal fami- 
lies (SFs), which descend from two basic subsets of 
monomers, A and B: to the subset A belong the SF types 
Jl, D2, W4, W5, Ml, and Rl, and to the subset B belong 



J2, Dl, Wl, W2, W3, and R2 (Alexandrov et al. 2001 



Romanova et al. 1996 Warburton and Willard 19961. 



We determine the SF assignments of monomers consti- 
tuting alphoid HOR by pairwise comparison between ev- 
ery monomer from HOR to every of 12 SF consensus 
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TABLE III Correspondence of large repeat and HOR units in 
Y chromosome contigs of human and chimpanzee. 



TABLE IV Riddled pattern with variety of number of monomers 
in human alphoid HOR copies (Build 37.1 assembly). 



Human 
125 bp PRU 
~171 bp PRU 
35mer/45mer SRU 
~545 bp PRU 
~1641 bp SRU 



Chimpanzee 

~171 bp PRU 
30mer SRU 
~550 bp PRU 
~1652 bp SRU 



~23541 bp TRU 


-23578 bp TRU 


--2385 bp PRU 


-2383 bp PRU 


~4757 bp SRU 


- 


~7155 bp SRU 




5 bp PRU 


5 bp PRU 


~3579 bp SRU 


- 


-5096 bp PRU (dispersed) 


-5096 bp PRU 


5 bp PRU 


5 bp PRU 


5607 bp SRU 




-10.8 kb PRU (within ~20309 bp PRU) 


-10762 bp PRU 


~10848 bp PRU 


-10853 bp PRU 




-64624 bp SRU 


-15766 bp PRU (dispersed) 


Dispersed 




fragments 


-15775 bp PRU 




-60910 bp PRU (dispersed) 


-60523 bp PRU 


-72140 bp PRU (dispersed) 


-72140 bp PRU 




-71778 bp PRU 




(dispersed) 





No. of monomers 




HOR copy no. 


Counting distinct 


Counting all 




monomers 


monomers 


1 


23 


23 


2 


45 


51 


3 


31 


31 


4 


1 A 

14 


1 A 


5" 


6 


6 


6 


35 


36 


7 


35 


35 


8 


35 


35 


9 


45 


45 


10" 


5 


5 



"Truncated at the start or end of the contig. Copies No. 1-4 
are from contig NT_011878.9. Copies No. 5—10 are from contig 
NT_087001.1 



TABLE V Average divergence between two subsets of alphoid 
monomers from 45mer HOR copies. 



Monomer comparison 


Divergence (%) 


10 vs. 10 


-19 


10 vs. 35 


-20 


35 vs. 35 


-21 



PRU primary repeat unit, SRU secondary repeat unit, TRU ter- 
tiary repeat unit 



10 denotes the subset of ten new monomers m24, . . . m33 

35 denotes the subset of 35 monomers m01,...m23 and 
m34, . . . m,45 



monomers from Romanova et al. ( 1996 ). A 45 x 12 diver- 



gence matrix is constructed between 45 monomers from 
HOR and 12 SF consensus monomers from IRomanova etl 



al. ( 1996 1 . To each monomer from HOR we assign the SF 
classification of the most similar SF consensus monomer. 
In this way we find that, out of forty-five monomers from 
HOR, forty monomers are of Ml type (in most cases the 
second lowest divergence corresponds to R2, and in three 
cases the Ml and R2 divergences are equal), and five are 
of R2 type (in these cases the second lowest divergence 
corresponds to Ml type). 

The differences between A and B subsets are, in gen- 
eral, concentrated in a small region which matches func- 
tional protein binding s ites for pjg in subset A and for 
CENP-B in subset B (IRomanova et al.||1996D. Anal- 



yses of human genome have indicated that a CENP-B 
box appears in the subset B monomers (in about 60% 
of B-type monomers) and is absent in the subset A 
monomers; while the pJa motif would occur only in some 
of monomers from the subset A and not in the subset B 
monomers ( Romanova et al.]|1996 |. 

After determining the SF classification of monomers in 
consensus HOR, we investigate the appearance of CENP- 
B box and pJa motif in these monomers. We find that 



the pJa motif (essential part) is present in 55% often new 
alphoid monomers and similarly, in 57% of the other 35 
monomers, while the CENP-B box is completely absent 
(Fig. [T]) . Consensus HOR has a robust p Ja distribution, 
containing 25 pJa motif copies. All alphoid monomers 
in consensus HOR are significantly more similar to pJa 
motif than to the CENP-B box: the mean deviation is 
0.6 bp for the pJa motif and 4.7 bp for the CENP-B 
box, reflecting that the absence of pJa motif in some of 
monomers from 45mer HOR can be attributedmostly to a 
single nucleotide mutation within an initially pJa motif. 

Since the pJa motif is essential for protein binding, an 
interesting question is whether the monomers with and 
without pJa motif have different sequence divergences. 
In this respect, pairwise divergence among 45 monomers 
shows no dependence on the presence or absence of the 
pJa motif. 

It should be noted that HOR copies in chromosome 
Y are the only reported case where pJa motif is present 
and CENP-B box absent. 

In this connection, we note a unique case of 13mer 
HOR (2214 bp consensus length) in chromosome 5, which 
contains neither CENP-B box nor pJa motif ( jRosandi6| 
eral][2006l ). 
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3. Alignment of Peripheral and Internal Human HOR Copies 

Let us now compare our consensus HOR for the pe- 
ripheral parts of major HOR alphoid block (DYZ3 locus) 
(Supplementary Table 2) to the 5941 bp secondary peri- 
odicity sequence in its internal part reported by |Skaletsky| 
et al.| ( 2003 1 which corresponds to the sequence gap be- 
tween the contigs NT_011878.9 and NT_087001.1 in the 
Build 37.1 assembly. 

First, w e fragm ent the 5941 bp sequence from 'Skalet-' 
sky et al. (2003) into 35 constituent alpha monomers, 
denoted wOl, . . . ,w35 (Supplementary Table 3). We 
find a peculiar feature of this secondary periodicity se- 
quence: two of its constituent monomers, w20 and w29, 
exhibit sizeable length deviation from the alpha satel- 
lite consensus length of 171 bp: the alphoid monomer 
w20 has a length of 104 bp (i.e., 67 nucleotides are 
deleted with respect to consensus alpha monomer length) 
while the monomer w29 is 224 bp long, containing a 53 
bp nonalphoid insertion with respect to consensus alpha 
monomer. 

To align the internal monomer sequence {w} (Supple- 
mentary Table 3) to the peripheral monomer sequence 
{m} (Supplementary Table 2), we shift the start position 
of alpha monomers mOl, rn02, . . . , to45, obtaining the se- 
quence denoted by nOl, n02, . . . , n45 (Table [y!]). The 35 



alphoid monomers from the sequence {w} are aligned to 
35 out of 45 monomers {n} ( Table [Vl| and Supplementary 
Table 4). The sequences n26, . . . ,n35 have no counter- 
part in the {w} sequence which corresponds to internal 
part of major alphoid HOR from Skaletsky et al. (2003). 



TABLE VI Transformation between monomer sets {m} and 
{n} and alignment between alphoid monomer sets {w} and {n} 



Transformation 




n01(169) = m44(. . 


.113) + m45(056...) 


n02(166) = m45(. . 


.110) + m01(056...) 


n45(170) = m43(. . 


.114)-Fm44(056...) 


Alignment 




wOl = nOl 




w25 = n25 




w26 = n36 




w35 = n45 





For definition of monomers {n} and {w} see Supplementary Ta- 
bles 3 and 4. In the transformation from {m} to {n} the notation 
mA4{. . . 113) denotes the last 113 bases in m44, m45(056 . . .) de- 
notes the first 56 bases in m45, and so on (Supplementary Table 
4). In alignment between {n} and {w} the 35 alphoid monomers 
from the sequence {w} are aligned to 35 out of 45 monomers from 
the sequence {n}. Here, the only significant differences appear be- 
tween ui20 and n20 (due to the presence of deletion in w20), and 
between w29 and n39 (due to presence of insertion in w29). The 
monomers n26, . . . n35 have no counterpart in the set {w} which 
corresponds to the internal part of major alphoid HOR 



4. Global Repeat Map for Riddled Alphoid HOR and 
Characteristic HOR-Signature in Human Chromosome Y 

To investigate more closely the major alphoid HOR 
array in human chromosome Y, we compute the GRM 
diagram for genomic sequence of Y chromosome (Fig. 
The most pronounced peaks in this diagram corre- 




20 30 40 50 60 

fragment length (bp) (10^3) 

FIG. 2 GRM diagram for Build 37.1 genomic assembly of hu- 
man chromosome Y for the intervals of fragment lengths: a 
0-1500 bp. There are two pronounced tandem arrays with re- 
peat units below 1.5 kb: the alphoid tandem repeat with alpha 
satellite repeat unit of 171 bp and the overlapping tandem repeat 
with repeat unit of 125 bp. The peaks at multiples of alphoid 
monomer repeat unit 171 bp, n-171 bp, are denoted by na. The 
peaks at multiples of 125 bp repeat unit, n-125 bp, are denoted 
by nS. b 0-80000 bp. Pronounced peaks above 2 kb are denoted 
by the corresponding fragment lengths. The most pronounced 
peaks are approximately at 2385, 10848, 15775, 20309, 23541, 
and 41584 bp. Arrow i: peak corresponding to 715mer. Arrow 
j: peak corresponding to 1123mer. For description of peaks see 
the text 

spond to following tandem repeats in chromosome Y: the 
alphoid repeats (GRM peaks at multiples of the ^171 bp 
repeat unit), the 125 bp repeats (GRM peaks at multiples 
of the 125 bp repeat unit), GRM peaks at multiples of 5 
bp repeat unit and GRM peaks corresponding to ^20.3 
kb repeat unit. In addition, there are nine pronounced 
GRM peaks at repeat lengths above 2000 bp. 

Here, we perform detailed study for alphoid HOR re- 
peat sequence. Analyzing partial contributions to GRM 
diagram of chromosome Y from individual contigs we 
find that the largest frequency contributions to alphoid 
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FIG. 3 GRM diagrams for sequences in contigs containing 
alphoid HOR in chromosome Y: a NT_011878.9, b NT_087001.1, 
and c secondary periodicity sequence for internal par t of major 
interior alphoid HOR block (genomic sequence from Skaletsky 
eTaT] ([2003[|) 



HOR peaks are arising from the contigs NT_011878.9 and 
NT_087001.1. The relevant intervals of fragment lengths 
for these two contigs are shown in Fig. [3^ and b, respec- 
tively. In both the figures peaks at approximate multi- 
ples of basic repeat length ^-^171 bp are decreasing with 
increasing multiple orders. That is a natural trend for 
tandem repeats. However, we do not find a peak corre- 
sponding to the HOR length, which for regular HORs in 
other chromosomes appears at their consensus lengths. 
This is because the Build 37.1 assembly of chromosome 
Y encompasses only peripheral tails of major HOR ar- 
ray and those exhibit sizeable riddling in both relevant 
contigs, as shown in the monomer structure of periph- 
eral HOR copies in Fig. [T] For these riddled HOR copies 
there is no dominating consensus length and therefore 
no peak corresponding to consensus length is present. 
Instead, the GRM diagram shows more intricate HOR- 
related peaks which characterize riddled alphoid HOR 
copies. These peaks will be referred to as GRM HOR- 



signature. Most pronounced GRM HOR-signature peaks 
of riddled HOR pattern in peripheral regions of major 
alphoid HOR in chromosome Y are at the lengths shown 
in Fig. |3^, b. These characteristic fragment lengths are 
fully consistent with the riddled HOR structure from Fig. 

HI 

As an example, let us consider the largest GRM HOR 
signature peak at 5551 bp, characterizing HOR pattern 
in NT_011878.9. This peak arises from approximate re- 
peat of the I3-I43 subsequence at the position of the 
I4-I44 subsequence. The distance I between the corre- 
sponding bases in these two subsequences (Table |VII[ ) is 
equal to a distance between monomers I3 and I4 (Fig. [l] 
and Supplementary Table 1). 



TABLE VII Contributions to the fragment length 5551 bp 
alphoid GRM HOR-signature peak for human Y chromosome 



Length (bp) 


Distance 


2896 


I3 - 173 


848 


193 - 233 


1194 


343 - 4O3 


278 


Nonalphoid insertion 


335 


443 - 453 


X;5551 





Therefore, the GRM diagram shows a pronounced 
peak at the 5551 bp fragment length, reflecting the rid- 
dling structure of HORs. Similarly, we interpret all the 
other HOR-signature peaks which characterize riddling 
in HOR copies from Fig. [T] 

In addition to GRM computation for Build 37.1 se- 
quence of chromosome Y, let us comment on the GRM 
HOR-signature related irregularity (monomers w20 and 
w29) in the interior region of major alphoid HOR array 
in chromosome Y (Supplementary Tables 3, 4). Figure 
|3j; displays GRM diagram compu ted for the 5941 bp sec - 
ondary periodicity sequence from Skaletsky et al. ( 2003 1 . 



Here again, we see the main pattern of monomer multi- 
ples ^171, ^ 2 X 171, ^ 3 X 171 bp,. . .with decreasing 
frequencies for increasing multiples. In addition, we ob- 
tain two weak subsequences of peaks, at fragment lengths 
-104bp,- (104-1-171 bp), - (104 -f-2x 171 bp), . . . and at 
-224 bp, -(224 171 bp), -(224 -f 2 x 171 bp),. . . These 
two additional weak subsequences are due to two dis- 
torted monomers in the 35mer periodicity (HOR) se- 
quenc e that we deduced f rom the HOR genomic sequence 
in Skaletsky et al. ( 2003 1 : the alphoid monomer ^20 has 



a length of 104 bp (i.e., 67 nucleotides are deleted with 
respect to consensus monomer) while the monomer w29 
has the length 224 bp, containing a 53 bp nonalphoid 
insertion with respect to consensus monomer. Such dele- 
tions/insertions in two distant alphoid monomers within 
HOR are absent in the peripheral regions of major HOR 
array in chromosome Y, i.e., they are absent in Build 
37.1 assembly. Therefore, GRM diagrams of these re- 
gions (Fig. [3^, b) do not have these two additional weak 
subsequences of peaks. This actualizes the interest for 
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1 5 10 15 20 25 30 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 

l-OH 1 1 1 1 l-CH-O-l-OH 1 1 l-CH 1 l-CH-CH 1 I-0-+-CH l-OH 1 1 1 HCH 1 1 

l-OH 1 1 1 1 l-O-l-O-l-OH 1 1 l-OH l-O-l-O-l-CH 1 l-O-l-CH l-OH 1 1 1 l-OH 

I 1 l-O+O-l-OH 1 1 l-OH 1 l-OH-OH 1 l-OH-OH l-OH 1 1 1 HOH 1 1 

HOH 1 1 1 1 HO-HO-I-OH 1 1 HOH 1 l-OH-OH 1 HOH-OH HOH 1 1 1 HOH HOH 

HOH 1 1 1 1 1 

I 1 HOH HOH 

I 1 1 1 1 1 HO-HO-HOH 1 1 HOH HO-HO-HOH 1 HO-HOH HOH 1 1 1 HOH HOH 

a 

HOH 1 1 1 1 HO-HO-HOH 1 1 HOH HO-HO-HOH 1 1 t-OH HOH 1 1 1 HOH 1 

I 1 HO-HO-HOH 1 1 HOH HO-HO-HOH 1 HO-HOH HOH 1 1 1 HOH HOH 

HOH 1 1 1 1 HO-HO-HOH 1 1 HOH HO-HO-HOH 1 HO-HOH HOH 1 1 1 HOH HOH 

HOH 1 1 1 1 HO-HO-HOH 1 1 HOH 

FIG. 4 Schematic presentation of aligned monomer structure of 30mer alphoid HOR (consensus length 5066 bp) in chimpanzee 
chromosome Y (Build 2.1, contig NW_001252921.1). Top row enumeration of 30 constituent alpha monomers from consensus HOR. 
Upper panel: HOR copies in interval 264-20019. Lower panel reverse complement of HOR copies in interval from 20618-42459. 
After monomer No. 20 (label a): 41 bp insertion (no similarity to monomers in SOmer). For comparison with human alphoid HOR 
see Fig.fl] Open circle pja motif (essential part) in alpha monomers 



future extension of Build assembly to the region of se- 
quence gap of ~3 Mb between the contigs NT_011878.9 
and NT_087001.1. 



5. Riddled SOmer HOR Scheme in Chimpanzee Chromosome Y 

Applying GRM to the chimpanzee chromosome Y, 
we find two SOmer HOR arrays in chimpanzee contig 
NW_001252921.1 (NCBI Build 2.1), positioned one af- 
ter another (with a gap of 599 bp in between) at the 
front part of the contig. The first HOR, truncated at 
the start of the contig is referred to as direct. In fact, 
it seems to be a truncated tail of a major HOR block 
positioned in unsequenced domain in front of the con- 
tig NW_001252921.1. We find that the reverse comple- 
ment of the second HOR array is highly identical to the 
first HOR array, and therefore this second HOR array is 
referred to as reverse complement. This indicates that 
the direct and reverse complement HOR arrays are posi- 
tioned on the opposite arms of a palindrome. 

Our results for detailed monomer scheme of these two 
peripheral HOR arrays, which are reverse complement to 
each other, are shown in Fig. |4] and Supplementary Table 
5. The consensus length of SOmer HOR unit is 5066 bp 
(consensus sequence in Supplementary Table 6). 

In GRM diagram of the whole chimpanzee Y chromo- 
some (Fig. [5]), the peak at 5066 bp fragment length is 
much weaker than the near-lying 5096 bp peak of an- 
other repeat structure (see Tables |ll III) and is therefore 
overshadowed. For this reason, we compute the GRM di- 
agram selectively for alphoid HOR-containing section of 
genomic sequence at the start of contig NW .001252921. 1 
(positions 1-20019) (Fig.|6|. In Fig.|6j in the length inter- 



val between 0.1 and 1 kb there are pronounced peaks ap- 
proximately at multiples of alphoid monomer repeat unit 
171 bp (Fig. [6^), in analogy to Fig. [5^ for the whole chim- 
panzee chromosome Y. Furthermore, the HOR-signature 
peaks are clearly seen in Fig. [6jD as pronounced peaks 
at 5066 bp (- SO x 171 bp, denoted as 30a), 4895 bp 
(- 29 X 171 bp, denoted as 29a), S884 bp (-- 2S x 171 
bp, denoted as 23a), and 8777 bp (-- 52 x 171 bp, de- 
noted as 52a). These HOR-signature peaks can be also 
deduced directly from HOR structure from Fig. [4] and 
Supplementary Table 5. 

For example, the 8777 bp (52a) HOR-signature peak 
arises from the approximate repeat of the 12-42 subse- 
quence at position of the I4-44 subsequence (the I3-43 
subsequence is missing due to riddling) (Table VIII I . Dis- 
tance between the corresponding bases in these two sub- 
sequences is equal to the distance between monomers I2 
and I4. 



TABLE VIII Contributions to fragment length 8777 bp alphoid 
GRM HOR-signature peak for chimpanzee Y chromosome 



Length (bp) Distance 



1868 


I2 - 


112 


2683 


132 


- 282 


1198 


53- 


113 


3028 


133 


-303 









Similarly, we interpret all the other pronounced 
HORsignature peaks in Fig.|6jD. The frequencies of these 
peaks are sizably smaller than of peaks arising from some 
other tandem repeats and therefore are overshadowed in 
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FIG. 5 GRM diagram for Build 2.1 genomic assembly of chim- 
panzee chromosome Y for intervals of fragment lengths: a 0- 
1500 bp. There is only one pronounced tandem array with repeat 
units in the interval between 0.1 and 1.5 kb: the alphoid tandem 
repeat with alpha satellite repeat unit of 171 bp. The peaks at 
multiples of alphoid monomer repeat unit 171 bp, n l71 bp, are 
denoted by na. b 0-80000 bp. Pronounced peaks above 2 kb 
are denoted by the corresponding fragment lengths. The most 
pronounced peaks above 1.5 kb are approximately at 2383, 5096, 
10762, 10853, 21218, 23578, 32071, 60523, 64624, and 72140 
bp. For description of peaks see the text 



Fig.[5]for the whole chimpanzee Y chromosome. We note 
that the HOR-signature peaks at 3884, 4895, 5066, and 
8777 bp are the only significant GRM peaks above 1.5 
kb in Fig. [6|d. 

Some peaks from GRM diagram for the whole chro- 
mosome Y (Fig. [5]) are missing in GRM diagram for the 
HOR section in Fig. [6^. For example, the peak at 551 
bp from Fig. [5^ is missing in Fig. [6^, because the repeat 
unit of 551 bp is positioned outside of the HOR-section 
of genomic sequence included in Fig. |6]a,. 

In addition to the equidistant multiple alphoid peaks, 
in the GRM diagram in Fig.|6^ there is a family of weaker 
equidistant peaks at fragment length 118, 118 -I- a, 118 + 
2a, 118 + 3a,. . . (like in Fig. 5] here a, 2a, 3a,. . . denote 
multiples of alpha monomer length ^171 bp). This weak 
equidistant family of repeat lengths is based on the 118 
bp peak. The origin of this peak is that one of monomers 
within HOR, m25, is truncated, with size reduced from 
the standard value ~171 to 118 bp. (Observe that we 
find an analog appearance of additional bands based on 
monomers of irregular length, 104 and 224 bp, for two 
human monomers in 35mer alphoid HOR in the interior 
part of HOR array.) 



6. Comparison of Alpha Satellite Monomers in Human 45mer 
and Chimpanzee 30mer HORs 

Computing divergence between 45 human consensus 
alpha monomers from consensus 45mer HOR and 30 
chimpanzee consensus alpha monomers from consensus 
30mer HOR (Supplementary Table 7) we see that due to 
scattering of divergences and the absence of any small 
divergence, none of chimpanzee monomers can be as- 
signed to a particular human monomer (Supplementary 
Table 8). In the whole human-chimpanzee divergence 
matrix the lowest divergence value is 12%, appearing in 
a few cases only (Table IX I . The mean value of the lowest 



TABLE IX Illustration of divergences of human monomers mOl 
and m24 with respect to 30 chimpanzee monomers 



Human monomer 


No. of chimpanzee 


Divergence (%) 




monomers 


mOl 


Two 


21 


mOl 


Four 


22 


mOl 


Three 


23 


m24 


Three 


12 


m24 


Three 


13 


m24 


One 


14 



This means, for example, that the lowest divergences between mOl 
(human) monomer and each of 30 chimpanzee monomers is 21% 
(with respect to two chimpanzee monomers), 22% (with respect to 
four chimpanzee monomers), etc. 
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FIG. 6 GRM diagram for HOR containing section from po- 
sitions 1-20019 bp in the chimpanzee contig NW_001252921. 
Intervals of fragment lengths: a 0-1000 bp, b 0-10000 bp. For 
description of peaks see the text 
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human-chimpanzee divergence for each human monomer 
is 17% (Supplementary Table 8). The absence of identity 
between particular human and chimpanzee monomers 
from alphoid HORs is also seen from the mean values 
of divergences in Table pC| 



TABLE X Comparison of mean values of human and chim- 
panzee consensus monomer divergences 



Divergence (%) 


45 Human vs. 45 human 


19 


30 Chimpanzee vs. 30 chimpanzee 


21 


45 Human vs. 30 chimpanzee 


23 



45 human denotes the set of consensus alpha monomers from hu- 
man 45mer HOR, and 30 chimp from chimpanzee 30mer HOR 

On the other hand, we find that alpha monomers in 
30mer HORs in chimpanzee Y chromosome are predom- 
inantly of Ml SF type, similarly as alpha monomers in 
35mer/45mer HORs in human chromosome Y. Accord- 
ingly, similarly as for human Y chromosome, monomers 
in chimpanzee Y chromosome are also characterized by 
the presence of pJa motif and the absence of CENP-B 
box (Fig. |4|. As already noted, the human Y chromo- 
some was the only known case where pJa motif is present 
and CENP-B box absent and now we see that the chim- 
panzee Y chromosome shares this feature. 

As to the degree of riddling, the human HOR is more 
riddled than the chimpanzee HOR. In particular, the 
human HOR has more insertions than the chimpanzee 
HOR, which is reflected in their respective GRM HOR 
signature. 



major alphoid HOR that may shed some new light at the 
mysteries of human Y chromosome are: 

The 33 consensus monomers from the peripheral HOR 
structure are highly identical to the aligned 33 monomers 
of previously reported secondary periodicity sequence 
from Skaletsky et al. ( [2003| ). On the other hand, we 
find peculiar differences: the lOmer alphoid sequence, 
inserted in the peripheral HOR structure, is absent in 
the reported internal structure; and in the previously re- 
ported internal secondary periodicity structure one con- 
stituent alphoid monomer has a sizeable deletion (67 bp) 
and the other a sizeable nonalphoid insertion (53 bp) ac- 
companied by clustered substitutions of 11 bases with 
respect to the peripheral HOR structure. 

The highly identical alphoid lOmer insert appears in 
both peripheral regions of major HOR, but was not re- 
ported so far in the internal centromere region between 
the two peripheral regions. 

The peripheral regions of major HOR alphoid block 
reveal coexistence: on one hand, very low divergence 
between the aligned constituent alpha monomers from 
different HOR copies (average divergence~0.3%) and, on 
the other hand, pronounced riddling due to deletions and 
insertions of alpha monomers and/or due to insertions of 
nonalphoid segments. The HOR copies in chromosome Y 
are the only known case where the pJa motif is present 
and CENP-B box absent. 

The major alphoid HOR in Y chromosome exhibits 
more deletions and insertions of alphoid monomers and 
highly distorted insertions than HORs in other chromo- 
somes. 



7. Peculiarities of Alphoid HOR in Human Y Chromosome 

We show that HOR structure in the peripheral regions 
of the major alphoid block in human chromosome Y is 
more complex than the previously reported structure for 
the internal region. In this computational study, we iden- 
tify and fully characterize the peripheral region, in par- 
ticular finding ten new monomers constituting alphoid 
HOR copies, different from the known 35 constituent- 
monomers, giving evidence for the presence of 45mer in 
the peripheral region of HOR array. Furthermore, while 
33 out of 35 constituting alphoid monomers in HOR 
copies in the interior HOR region are highly homologous 
to the corresponding monomers in the peripheral region, 
we find that the remaining two monomers in the interior 
region have a sizeable deletion and nonalphoid insertion, 
respectively, with respect to the corresponding monomers 
from the peripheral region. The study of these riddled 
HOR copies may be valuable for understanding possible 
sources of genomic diversity, but also has the potential 
to provide useful markers for medical, population, and 
forensic genetic studies, and may give a route for identi- 
fying mechanisms of DNA sequence evolution. 

Some peculiarities studied in this work regarding the 



8. Difference Between Humans and Chimpanzees Alphoid HOR 
Repeat Units 

The number of different monomers constituting HOR 
in human Y chromosome (45 monomers in the periph- 
eral sections of major HOR array, and 35 monomers in 
the interior section) is different than in the chimpanzee 
genome (30 monomers). 

HOR pattern in the sequenced domain in Build 37.1 
assembly (peripheral region) is characterized by substan- 
tial riddling, which is more pronounced in human than 
in chimpanzee genome. 

All alpha satellite monomers constituting major hu- 
man 35/45mer HOR are different from monomers consti- 
tuting chimpanzee 30mer HOR by^20%, which is com- 
parable to divergence between monomers within a single 
HOR copy. 

The lengths of major alphoid HOR arrays in human 
and chimpanzee are widely different, ^-^3 and ~1 Mb, 
respectively. 
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B. Other Human and Chimpanzee Tandem, HOR and 
Regularly Dispersed Repeat Arrays Based on Large Repeat 

Units 

Besides the alphoid HOR, in human Build 37.1 and 
chimpanzee Build 2.1 Y chromosome assemblies we find 
over 20 other large repeat units (Tables |l) In III ) . Some 



TABLE XI Dispersed 3mer HOR copies based on ~550 bp 
monomer in chimpanzee Y chromosome 



of large repeat units appear both in human and in chim- 
panzee genomic assembly, and some in human only or in 
chimpanzee only. We describe here some pronounced re- 
peats identified from GRM diagrams (labeled a in Tables 
[ij [ll| . The remaining repeats (denoted b in Tables |lj [ll]) 
are described in Supplementary information. 



1. Chimpanzee ~550 bp Primary Repeat Unit, ~1652 bp 3mer 
HOR Secondary Repeat Unit, and ~23578 bp Tertiary Repeat 

Unit 

In the GRM diagram for chimpanzee Y chromosome in 
the length interval between 100 and 1500 bp, besides the 
major peaks associated with alphoid HOR and tandem 
repeat based on the 125 bp repeat unit, there is additional 
pronounced peak at ^550 bp (Fig. [5^). Using GRM, we 
find that this peak arises due to the appearance of 3mer 
HOR copies constituted from three~550 bp monomers, 
denoted mcOl mc02 and mcOS. These monomers are 
mutually diverging by ~8%, while different 3mer HOR 
copies mutually diverge by only ~1%. About eight times 
smaller divergence between 3mer copies then between in- 
dividual monomers within each 3mer are a signature of 
HOR. However, these HOR copies are not in tandem, in 
contrast to previously known HOR structures; instead, 
they are dispersed with rather regular spacings. Consen- 
sus sequences of three monomers mcOl mc02 and mc03, 
determined from NW_001252921.1 (using key string AG- 
GTACTG) are given in Supplementary Table 9. The 
main contributions to the ~550 bp GRM peak arise from 
the array of ^550 bp monomers within each 3mer copy. 

Performing the GRM analysis we find 20 dispersed 
HOR copies (Table |Xl|. In addition, in four HOR copies 
in NW_001252921.1 one of three ^550 bp monomers is 
deleted. In NW_001252921.1, we find dispersed highly 
identical 3mer HORs, direct and reverse complement. 
HOR copies after the first one are grouped into five pairs 
of 3mers: 

DSD 
R S R 
DSD 
R S R 
DSD 

where D is the direct 3mer copy, R is the reverse com- 
plement 3mer copy, and S is the spacing of ^24 kb (see 
Table XI). (Three of 3mer copies in these pairs of 3mer 
copies are truncated from three to two monomers.) Since 
the two 3mer copies in each pair are separated by spacing 
S, there is no GRM peak at ~1.65 kb. Instead, this gives 



Contig 


HOR copy 


Direction 


Monomers 




start position 


in HOR copy 


NW_001252921.1 


d1o29d 




mc03 mc02 mcOl 




1021470 


D 


mcOl mc03 




11)44498 


D 


mcOl mc02 mc03 




13JU195 




mc03 mc02 mcOl 




1353792 


RC 


mc03 mc02 mcOl 




1 ^ CT TO AO 

1757o03 


D 


mcOl mc02 mc03 




1 T0 1 on 1 

1781391 


U 


mcOl mc02 mc03 




OA£; OTO 1 

2Uo3781 


KC 


mc03 mc02 mcOl 




2087364 


RC 


mc03 mcOl 




z49UUzo 


D 


mcOl mc03 




O C 1 O A C 1 

2513051 


U 


mcOl mc02 mc03 




2798724 


RC 


mc03 mcOl 


NW_001252926.1 


232825 


D 


mcOl mc02 mc03 




516623 


RC 


mc03 mc02 mcOl 




540165 


RC 


mc03 mc02 mcOl 


NW_001252919.1 


328953 


D 


mcOl mc02 mc03 




574371 


RC 


mc03 mc02 mcOl 


NW_001252925.1 


922846 


D 


mcOl mc02 mc03 




1197882 


RC 


mc03 mc02 mcOl 


NW_001252915.1 


955834 


RC 


mc03 mc02 mcOl 



RC denotes a HOR copy having reverse complement sequence with 
respect to HOR copy defined as direct (D). In reverse complement 
HOR copy each monomer is reverse complement with respect to 
direct monomer sequence 



rise to a tertiary repeat unit, with a ~24 kb peak (more 
precisely ~23578 bp) in the GRM diagram. 

We find even an approximate next higher pattern, 
three copies of quartic repeat unit: 

RS2DSDS1RSRS2DSDS1RSRS2DSDS1R 

where S2 is spacing of ^0.40 Mb, and Si spacing of ^0.28 
Mb (see Table pOl). The length of this unit is -0.73 Mb. 



In NW_001252921.1, we find an array of three such quar- 
tic repeat units. This would give rise to a GRM peak 
at ~0.74 Mb fragment length (computation is performed 
here up to 100 kb fragment lengths). 

We note that in NW_001252926.1 we find a 
D Si R S R subsection of the above pattern. 



2. Human ~545 bp Primary Repeat Unit, ~1641 bp 3mer HOR 
Secondary Repeat Unit, and ~23541 bp Tertiary Repeat Unit 

The GRM peak at 545 bp is due to the ~545 bp 
monomers, organized in dispersed 3mer HOR copies of 
~1641 bp (Table XII). The distance between start posi- 
tions of two 3mer copies is again ~24 kb, similar as in the 
chimpanzee Y chromosome, giving rise to the appearance 
of ^23541 bp peak in GRM diagram. 
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TABLE XII Dispersed 3mer HOR copies based on ~545 bp 
monomer in human Y chromosome 



Contig 


HOR copy 


Direction 


Monomers 




start position 




in HOR copy 


NT_011903.12 


76992 


RC 


m03 m02 mOl 




100533 


RC 


m03 m02 mOl 




365459 


RC 


m03 m02 mOl 




609306 


D 


mOl m02 m03 


NT_011875.12 


9862260 


D 


mOl m02 m03 




9885800 


D 


mOl m02 m03 




9909341 


D 


mOl m02 m03 


NT_086998.1 


185824 


D 


mOl m03 



RC denotes a HOR copy having reverse complement sequence with 
respect to HOR copy defined as direct (D). In reverse complement 
HOR copy each monomer is reverse complement with respect to 
direct monomer sequence 



The 23541 bp repeat unit corresponds to previously 
reported 23.6 kb repeat units containing RMBY genes, 
but previously it was not related to the 54 5 bp PRU 
( [Skaletsky et al.||2003[ [Warburton et al.||2008[ ). 

As seen, the human HOR pattern of sequenced Y 
chromosome contains fewer copies than chimpanzees and 
is less symmetrically organized. The human ^545 bp 
monomers (denoted mOl m02 to03) are similar to the 
chimpanzee ~550 bp monomers (denoted mcOl mc02 
TOc03): divergence between the human 3mer HORs mOl, 
to02, and m03 and the chimpanzee 3mer HORs is ~4%, 
while the divergence between off-diagonal monomers 
(i.e., mOl vs. mc02, mOl vs. mc03,...) is ^8%. Only 
a small subsection of ^24 kb encompassing each human 
HOR copy is similar to the corresponding section encom- 
passing each chimpanzee HOR copy (divergence less than 
10%), while the remaining part of large spacings, of to- 
tal length ~2 Mb, strongly diverges between human and 
chimpanzee. This gives a substantial contribution to the 
overall human-chimpanzee divergence. Furthermore, the 
subsequences of ~24 kb human sequence are scattered in 
various parts of chimpanzee Y chromosome. 



3. Human ~2385 bp Primary Repeat Unit and ~7155 bp 3mer 
HOR Secondary Repeat Unit 

The DAZ gene family, located in the AZFc region of 
Y chromosome, is organized into two clusters and con- 
tains a variable number of copies (Ferna ndes et al.|2006 



Glaser et al. 1998, Saxena et al. 2000,,Seboun et al. 19971. 



A ~2.4 kb repeat unit in DAZ genes was reported by 
( [Skaletsky et al.||2003[ [Warburton et a]L||2008[ ). Accord- 
ingly, the CRM peak at 2385 bp (Fig. [2f)) is due to tan- 
dem repeats with ^2.4 bp PRU in DAZ genes. Human 
DAZ repetitions are located in contig NT_011903.12 (po- 
sitions 1346649 to 1361029, 1425263 to 1473290, 2977988 
to 2997102, and 3050498 to 3086580), i.e., from position 
25.3 to 27 Mb within the human Y chromosome. 



Using GRM we classify the assembly of ~2.4 kb 
monomers into five monomer families (consensus se- 
quences in Supplementary Table 10). The average di- 
vergence between monomers of the same family is be- 
low 1%, while the average divergence between monomers 
from different families is ^11%. The monomer fam- 
ily with highest frequency of appearance has consensus 
length 2385 bp, which determines the length of the 2385 
bp GRM peak. This monomer family forms a highly ho- 
mologous monomeric tandem repeat, which is present in 
DAZ2 and DAZ4 genes. 

We find that the GRM peak at 7155 bp corresponds 
to 3mer HOR composed of three variants of ~2.4 kb 
DAZ repeat monomers, denoted mOl, m02, and to03 (the 
first three consensus sequences from Supplementary Ta- 
ble 10). Computing the GRM diagram of any of the 
7155 bp copies we obtain two pronounced peaks, at ^2.4 
and ~4.8 kb, revealing the 3mer character. We find that 
these 3mer HOR copies are present in all four DAZl- 
DAZ4 genes. Human DAZ genes contain 12 DAZ HOR 
copies organized into four tandem arrays (DAZ1-DAZ4). 

The ~4757 bp peak in GRM diagram corresponds to 
the 2mer HOR copies arising from 3mer HOR by deletion 
of one monomer from the 7155 bp secondary 3mer HOR 
unit. In GRM diagram of the 4757 bp repeat copies, 
we obtain only one pronounced GRM peak, at ^2.4 kb, 
showing the 2mer character of 4757 bp repeat copies. We 
find that such 2mer HOR copies are present in all four 
DAZ1-DAZ4 genes. 



4. Chimpanzee ~2383 bp Primary Repeat Unit and Absence of 
Tandem of Higher Order Repeats 

The GRM peak at ~2383 bp is due to tandem repeats 
with ^2.4 bp repeat unit in DAZ genes in chimpanzee Y 
chromosome. Chimpanzee DAZ repetitions are located in 
contigs NW_001252917.1 (positions 1109191 to 1130961 
and 1259092 to 1280862) and NW_001252922.1 (positions 
997017 to 1028356 and 1070171 to 1099128) that is at 
chromosome positions from ^3.2 to 3.4 Mb and from 
^11.2 to 11.3 Mb. Positions of the corresponding sub- 
sequences widely differ in human and chimpanzee chro- 
mosomes. Divergence between human and chimpanzee 
consensus sequences is ~5%. 

We find that the chimpanzee Y chromosome contains 
3mer and 2mer HOR copies, similar to those for human Y 
chromosome, but with one pronounced distinction: chim- 
panzee DAZ genes contain four DAZ HOR copies, which 
are, unlike the case of human Y chromosome, not orga- 
nized into tandem but into dispersed HOR copies. There- 
fore, there are no GRM peaks corresponding to HORs. 

The presence of tandem of DAZ HOR copies in hu- 
man and absence of such tandem in chimpanzee Y chro- 
mosome provides an interesting evolutionary distinction 
between human and chimpanzee Y chromosomes. 
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5. Human ~3579 bp 715mer HOR Unit and 5 bp Primary 
Repeat Unit 

The GRM peak at ^-^3579 bp is due to a tandem of 
28 repeat copies in NT_025975.2. These copies differ in 
lengths from 3544 to 3589 bp. The length 3579 bp has 
the highest frequency and is equal to consensus length. 
Other copy lengths appear due to deletion or insertion 
of 5 bp subsequences. Average divergence of copies with 
respect to consensus sequence is ~1%. Due to differences 
in lengths of copies, the GRM peak at ^-^3579 bp is broad- 
ened (Fig.[2]D). 

In the next step, we find a strong peak at the fragment 
length 5 bp in GRM diagram for the 3579 bp consensus 
sequence. A dominant key string for segmentation of 
the 3579 bp consensus sequence into 5 bp fragments is 
ATTCC, which is the consensus sequence of 5 bp primary 
repeat copies. Thus the 3579 bp repeat unit is a 715mer 
HOR based on ATTCC primary consensus repeat unit. 
Here 34% of primary repeat 5 bp copies are equal to 
consensus, 38% differ from consensus by one base, 21% 
by two, 6% by three and 1% by four bases. 

This 3579 bp HOR corresponds to the p reviously re- 
ported 3584 bp HOR ( jSkaletsky et al.|[2003l ). 



6. Absence of Chimpanzee HOR Unit Corresponding to Human 
3579 bp 715mer HOR Unit 

In the Build 2.1 assembly for chimpanzee Y chromo- 
some we find no analog of the human 3579 bp 715mer 
HOR unit. 



7. Human ~5607 bp 1123mer HOR Unit and 5 bp Primary 
Repeat Unit 

The 5607 bp peak corresponds to a new HOR, with 
5607 bp SRU (5 bp GGAAT PRU). The main contribu- 
tion to this peak is from contig NT_113819.1. We identify 
a tandem of 11 copies, from position 496682 to 553881 
(Supplementary Table 11) and determine the 5607 bp 
consensus sequence (Supplementary Table 12). 

To investigate the structure of 5607 bp repeat unit, we 
compute the GRM diagram of its consensus sequence. 
Using 8 bp key string ensemble, we obtain the GRM di- 
agram characterized by a set of GRM peaks at fragment 
lengths of 5 bp and its multiples (Supplementary Fig.jl^), 
revealing the underlying 5 bp PRU. However, the recipro- 
cal distribution of GRM peaks shows deviation from the 
exponential distribution expected due to random muta- 
tions of fragments of multiple orders at KSA recognition 
sites. This deviation is due to the fact that the length 
of key strings in the ensemble is larger than the repeat 
unit. This is shown by computing the GRM diagram by 
using the 3 bp key string ensemble, shorter than the 5 
bp PRU (Supplementary Fig. [i]d). In that case the re- 
ciprocal distribution of GRM peaks corresponding to the 



5607 bp consensus sequence indeed follows exponential 
distribution, as expected. 

The 5607 bp HOR consensus unit consists of 1123 
pentamer copies. Out of these copies, 353 are identi- 
cal to GGAAT which is the primary repeat consensus. 
The mean divergence between 5 bp consensus GGAAT 
and pentamer copies that are not identical to consensus 
is ~30%. Differences are mostly due to substitutions. 
There are only a few indels: two copies have 1-base inser- 
tion, one has 2-base insertion, ten have 1-base deletion 
and one has 2-base deletion. 



8. Absence of Cliimpanzee HOR Unit Corresponding to the 
Human 5607 bp HOR Unit 

In the Build 2.1 assembly for chimpanzee Y chromo- 
some we find no repeat unit corresponding to human 5607 
bp HOR unit. 

9. Chimpanzee 10853 bp Primary Repeat Unit and 64624 bp 
Secondary Repeat Unit 

The GRM peak at 10853 bp is due to a tandem in 
NW_001252917.1 (eight copies), with repeat unit consen- 
sus length 10853 bp. The 10853 bp consensus sequence 
is given in Supplementary Table 15. The third copy in 
this tandem is distorted: truncated after the first 6399 
bases and followed by a large insertion, so that the total 
length of truncated third copy and neighboring insertion 
amount to the combined length of 21218 bp. The struc- 
ture of the eighth copy is distorted similarly as the third 
copy, leading again to a ~21 kb combined length. 

Distance between the corresponding bases in neigh- 
boring copies (except those involving the third copy) is 
-10853 bp, giving rise to the 10853 bp GRM peak. 

Distance between the start of the 6399 bp subsection 
of the third copy and the start of the fourth copy is 21218 
bp, giving rise to the 21218 bp GRM peak. Distance from 
the end of the second copy (which has no counterpart in 
the truncated third copy) to the end of the fourth copy 
is 10853 -I- 21218 bp = 32071 bp, giving rise to the 32071 
bp GRM peak. 

The copies No. 1, 2, and 4-7 are identical up to 1%, 
while the copies No. 3 and 8 have similar truncation and 
additional insertion. Therefore, the copies No. 1-5 form 
a secondary repeat HOR copy of the approximate length 
2 X 10853 + 21218 + 2 x 10853 (precise value 64624 bp). 
The last three copies in tandem. No. 6-8, represent the 
first three copies belonging to the second 64624 bp HOR 
copy. 

The insertion after the truncated third copy in chim- 
panzee tandem repeat with 10853 PRU 21218-6399 bp 
— 14819 bp is also present in the human Y chromosome 
as a tandem of two repeat units (divergence ~ 4%) in 
contig NT_011903.12. Because these repetitive units are 
mutually reverse complement, GRM diagram for human 
chromosome Y does not show this peak. 
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C. Summary of Human-Chimpanzee Divergence Due to 
Repeats Based on Large Repeat Units 

We determine approximately the number of bases 
which are different in repeat arrays of human and chim- 
panzee Y chromosome using a simple formula: 



= ^ = ^ f min(Z,;_hum, 'i.chimp) ' Pi + I', 



(1) 



Here, h^hum and Zi. chimp are sums of lengths over all 
copies of the iths human and chimpanzee repeat unit, 
respectively; min(Zi^hum, 'i, chimp) is the smaller of two 
lengths li^hum and /i, chimp; h — |^i,hum ~ ^i.chimph and 
Pi is divergence between human and chimpanzee repeat 
unit i. In this way, we include contributions to human- 
chimpanzee divergence both from substitutions and in- 
dels. 

For example, in the case of alphoid H OR i n Y chro- 
mosome (repeat No. 1 from Tables [H [!!{ [m| we have: 

hum = 3048138 bp, ^i.cWmp = 1042459 bp, h = 2005679 
bp, pi = 0.20, giving di = 2.214.171 bp (Fig.[7]). With 
respect to the sequence of larger alphoid HOR, of the 
length Zi,hum, this corresponds to an approximate diver- 
gence 100 • di/h^hnn, = 72.6%. 



jhuman 
^i, chimp 



Pi \ 



human 
chimpanzee 



FIG. 7 Schematic presentation of applying the formula for cal- 
culation of human-chimpanzee divergence for the case of a large 
repeat unit (major alphoid HOR) 

Summing over all repeats (z = 1,2,.. .) from Tables |lj 
and |III[ we obtain a summary number of different bases 
between human and chimpanzee large repeats: d ^ 3.4 
Mb (3378539 bp). The corresponding divergence with 
respect to all repeats from Tables IT) III) and |III|is: 



div(rep) = 100 ■ 



(2) 



where the summary length of all repeats from Tables |T) 
jn) and]lll)is L = 4848892 bp. 



Thus, we obtain divergence with respect to repeat se- 
quences included in Tables IT) ]Tl) and |III| 



(3) 



div(rep) w 70%. 



If we smear out divergence over the whole Build sequence 
of length Las = 25 Mb, we obtain the overall divergence 
with respect to assembly length: 



div(Build) 
div(Build) 



100- 
14%. 



(4) 
(5) 



This estimate of overall divergence due to repeats 
based on large repeat units should be additionally in- 
creased due to overall estimates of approximately 1-2% 
divergence for nonrepeat sequences. 

Both the human and the chimpanzee Y chromosome 
sequences are still incomplete; in human chromosome 
^25 Mb out of total length of ~59 Mb was sequenced. 
Thus, a greater contiguity at several genomic regions 
is desired to reach more precise conclusions regarding 
human-chimpanzee divergence. However, the main body 
of results will probably stand, because, in general, non- 
sequenced gaps are rich in repeat structures. It should 
be noted that a whole-genome comparison of chimpanzee 
and human revealed an increased divergence in the termi- 
nal 10 Mb of the corresponding chromosomes, consistent 
with general association between increased divergence 
rates and location near the chrom osome ends (Mikkelsen 
et al.)|2005) jPollard et a T 2006a). In general, and m ac 



cordance with Gibbs et al. (2007J , it can be expected that 
unsequenced regions of repeat elements, that are difficult 
to align, might for the whole Y chromosome somewhat in- 
crease the presently estimated divergence of 14% for the 
sequenced part. Definitive studies of genom e evolution 
will require high-quality finished sequences (Mikkelsen 
erar))2005] ). 



An interesting question is how much the observed size- 
able divergence can be generalized to the whole genome. 
In this sense, we have started a systematic study of 
human-chimpanzee divergence due to large repeats in 
other chromosomes. 

We see a tendency that large repeat units in humans 
are on average larger and copy numbers greater than 
those in chimpanzees. This is in accordance with pre- 
vious observation that microsatellites in humans are on 



average longer than those in chimpanzees (Vowles and 



)A^os..2006 ) 

We identify large repeat units which contribute sub- 
stantially to divergence between humans and chim- 
panzees. Our results indicate that alphoid HOR and 
most of characteristic tandem repeats with large repeat 
units (some present only in human and not in chim- 
panzee Y chromosome, or some vice versa) have been 
created after the human-chimpanzee separation, while 
only a smaller number of tandems with large repeat units 
(present both in human and in chimpanzee Y chromo- 
some at low mutual divergence) originate from a com- 
mon ancestor that predated the human-chimpanzee sep- 
aration. This is in accordance with previous observations 
in some other chromosomes that alpha satellite subsets 
found in great apes and humans are in general not located 
on their corresponding homologous chromosomes (Jor- 
gensen et al.|1992 Warburton et al.))l996 ); for example. 



the alpha satellite subset on human chromosome 5 is a 
member of SF 1, while the homologous chimpanzee chro- 
mosome belongs to SF 2 (]Haaf and Willard))1997| )1998 1. 



It was pointed out that this implies that the human- 
chimpanzee sequence divergence has not arisen from a 
common ancestral repeat, but instead represents initial 
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amplification and homogenization of distinct repeats on 
homologous chromosomes (nonorthologous evolution). 



Haaf and Willard ( 19971 discussed the propositions for 
homogenization of alpha satellites. Homogenization pro- 
cesses appear to proceed in localized, short-range fashion 
that leads to formation of large domains of sequence iden- 
tity (Durfy and Willard 1989; Tyler-Smith and Brown 



1987[ |Warburton and Willard^,1990|). Genomic tu rnover 
mechanisms (molecular drive; ( Dover||1982 1986)) must 
be at work that spread and homogenize individual variant 
repeat units throughout arrays and throughout popula- 
tions (Haaf et al. 1995). However, the mechanisms by 



which this concerted evolution occurs seem unclear, al- 
though several genomic turnover mechanisms such as un- 
equal crossing over between repeats of sister chromatids 
( |Smith|p76l ), sequence conversion ( |Baltimore|[T98T| ), se- 
quence transp osition (|Calos and Miller| 1980 ), transloca- 
tion exchange ( Krystal et al.|1981 1, and disproportionate 
replication (Hourcade et al. 1973 Lohe and Brutlag 1987 



Spradling 1981 1 have been observed to be active in cer 
tain genomes. 

Previous FISH studies support the conclusion that the 
localization of SF 3 alpha satellite is substantially con- 
served, while alpha satellite sequences belonging to fam- 
ilies 1 and 2 are not shared by the corresponding chim- 
panzee homologs ( Archidiacono et al. 1995 D'Aiuto et 



al. 1993). Here we find that, although the SF 4 which 
is composed of Ml alpha satellite monomers constituting 
human and chimpanzee alphoid HORs in Y chromosomes 
is conserved, both the alpha satellite monomers in human 
and chimpanzee HORs and the HOR lengths are widely 
different. 

It was pointed out that it is not known whether evo- 
lutionary important mutations predominantly occurred 
in regulatory sequences or coding regions ('Carro n|[2003| 



King and Wilson| 19751 [McConkey, 2002, McConkey et al. 



2000 Olson and Varki 2003 1 . Preliminary data suggested 



that gene expression patterns of human brain might have 



evolved rapidly ( 


Caceres et al.| 2003 


Dorus et al. 


2004 


Enard et al.||2002 Uddin et al.||2004 


1. 



Comparative genomic analyzes strongly indicated that 
the marked phenotypic differences between humans and 
chimpanzees are likely due more to changes in gene reg- 



ulations then to modifications of genes themselves ( King 



and Wilson 1975 Pollard et al. 2006a|b Popesco et al. 


2006 


Prabhakar et al 


2006 


). The gene regulatory evo- 



lution hypothesis proposes that the striking differences 
between humans and chimpanzees are due to gene ex- 
pression: the change of pattern and timing of turning 
genes on and off. 

Pollard et al] ( |2006b[ ) identified -100 bp short ge- 



nomic regions that are highly conserved in vertebrates, 
but show significantly accelerated substitution rates on 
human lineage relative to chimpanzee (Pollard et al. 
2006a|b |. Many of these Human Accelerated Regions 
(HARs) , characterized by dense clusters of nucleotide 
substitutions, are associated, in particular, with the ner- 
vous system, reproductive system, and immune system. 



Detailed studies have indicated that forces other than 
selection for random mutations that increase fitness in 
specific functional elements may be at play in strongly 
accelerated regions ('Pollard et al. 2006a I. There is a 



possibility that changes in the accelerated regions result 
from a combination of multiple evolutionary processes, 
perhaps including biased gene conversion and a selection- 
based process ( jPoUard et aI1|2006a[ ). 



Here, we find another type of accelerated regions: for 
some repeat arrays we find dramatic evolutionary acceler- 
ation of repeat pattern, from monomeric arrays in chim- 
panzee to HOR organization of repeat arrays in human Y 
chromosome, i.e., the rapid onset of unequal crossing over 
in human lineage. Such region of accelerated evolution 
of HOR pattern will be referred to as human accelerated 
HOR region (HAHOR). 

The hallmark of evolutionary shift of function is sud- 
den change in a region of genome that previously has 



been conserved ( jPollard et al. 2006b). The function of 



sets of genomic regulatory sequences has been previously 
compared to electronic microprocessing: they process the 
information contained in a set of regulatory elements 
into the corresponding pattern of gene expression. It 
was noted that one of basic ways how the regulatory 
genomic features are related to evolutionary processes 
is the recruitment of existing regulatory pathways into 



context (Gierer 


1998 


Pires-da Silva and 


Tautz 


20001. These processes follow the 



rules of nonlinear interactions. These, in turn, allow for 
sudden or very fast changes resulting from the accumula- 
tion of rapidly succeeding small steps with self-enhancing 
features. Furthermore, mechanisms of bifurcation and de 
novo pattern formation may lead, for instance, to strik- 
ingly different developments in parts of an initially near- 
uniform area. Thus, in general, small causes can result 
in big effects ( |Gierer|2004] . Finally we note a possibility 
that accelerated large repeat units and HAHORs could 
have a functional role of new categories of long-range 
regulatory elements ( Noonan and McCallion|["2010 ). 



IV. CONCLUSION 

In this study, we identify and analyze tandem repeats, 
HORs and regularly dispersed repeats in chimpanzee and 
human. For the first time we report a dozen new large 
repeats in chimpanzee and several new large repeats in 
human genome. Comparing the corresponding repeats 
based on large repeat units in human and chimpanzee we 
find substantial contribution to the human-chimpanzee 
divergence from these repeats, approximately 70% diver- 
gence with respect to repeat arrays based on large re- 
peat units. Smearing out these differences in large re- 
peats over the whole sequenced assemblies, human Build 
37.1 and chimpanzee Build 2.1, i.e., by neglecting diver- 
gence between other segments of genome sequences, we 
obtain an overall human-chimpanzee divergence between 
sequenced assemblies of approximately 14%. This numer- 
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ical estimate far exceeds the available earlier numerical 
estimates for human-chimpanzee divergence. 

Our results are in accordance with recent publication 
by Hughes et al. (2010) where it was shown by overall 



comparison that the human and chimpanzee MSYs differ 
radically. 

We explicitly identify, analyze, and compare a dozen 
of large repeats which give a substantial contribution to 
human-chimpanzee divergence. 

We find in humans several HAHORs on human lin- 
eage relative to chimpanzee, containing HOR structures, 
in particular the alphoid HORs, the ^2.4 kb DAZ repe- 
titions and the ~15.8 kb repetitions. On the other hand, 
in chimpanzee genome we find a chimpanzee-accelerated 
HOR region (CAHOR) based on -550 bp PRU. 



While the HARs discovered previously (|Pollard 


2009 


Pollard et al.||2006a|b| |Popesco et al.|2006||Prabha 


car et| 


al 


2006 


) were HARs characterized by short dense clus- 



ters of nucleotide substitutions, the HAHORs found in 
this work are characterized by higher-order organization 
extended over larger genomic stretches. 

Our results show explicitly that large repeat units and 
HORs provide substantial contribution to the human- 
chimpanzee divergence. 



V. GRM ANALYSIS 

GRM analysis was performed using novel GRM code, 
which is available upon request. 
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